The following notebook demonstrates the marginalia determination process. The purpose of this script is to prepare crop metadata for the OCR process that follows. Though this notebook displays a number of cropped and manipulated page images, in practice images like these are not stored. The crop metadata generated as output by this script is used during the OCR process to determine which areas of text should be fed to the OCR software. The background color metadata is used to produce clean margins for cropped images that will enable cleaner OCR output.
A small subset of example images and their associated metadata have been included for demonstration purposes.
marginalia_determination serves to identify which area of the original contains only the main body text. This process involves a number of sub-processes, including the trimming of blank margins, the removal of headers from most pages, and the separation of main body text from the marginalia.
The general flow of marginalia_determination can be summarized as follows:
For details regarding each subprocess, see explanations below. Examples comparing results from each step will be provided at the end of this notebook.
import sys
import os
import csv
import shutil
import time
import random
from collections import Counter
from PIL import Image, ImageChops, ImageStat
import numpy as np
from scipy.ndimage import interpolation as inter
First we indicate our current directory so that Python knows to check there for our own local modules (for this example we'll need 'cropfunctions.py'. The 'example_image.py' file contains scripts used only for demonstration purposes in this notebook. It is not part of the production code.)
sys.path.append(os.path.abspath("./"))
from cropfunctions import *
The following code block extracts metadata for each page that is required to begin the marginalia determination process. The 'SampleMetadata' file in this example represents a smallsubset of a larger file that holds metadata for all images. This metadata was compiled from the following sources:
In the file, each row represents a single page leaf from the original scans. We extract the following information for each and compile a list of dictionaries, each with the following information:
Note: The "filename" and "folder" fields are named according to the folder structure established for our set of images (i.e. "lawsresolutionso1891nort_0697.jp2" and "lawsresolutionso1891nort_jp2"). If used with a different set of images, the procedures in the following loop will need to be adjusted.
Once the list of metadata dictionaries (one for each row) has been assembled, the pages are sorted according to their filepath strings.
master = []
with open("SampleMetadata.csv", "r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
row_dict = dict()
row_dict["filename"] = row["filename"] + ".jp2"
row_dict["side"] = row["handSide"].lower()
row_dict["folder"] = row["filename"].split("_")[0]+"_jp2"
row_dict["type"] = row['sectiontype']
row_dict["start_section"] = False
master.append(row_dict)
master = sorted(master, key = lambda i: i['filename'])
#Each "row" will look as follows:
print(master[0])
Once we have extracted the required metadata, we determine which pages mark the beginning of a new "type." This is important because section "start" pages often contain large headers with important metadata. The steps below remove all headers except those from "start_section" pages so that we can retain the information they contain. To do so we iterate through all pages in our 'master' list. If it is found that a given page "type" is different from the page immediately before it, the page in question is designated as a "start" page. This assumes that no sections of identical "types" will follow one another consecutively.
Our examples images are of two different "types."
for k in range(1,len(master)):
if master[k]["type"] != master[k-1]["type"]:
master[k]["start_section"] = True
#start section values for sample images
print([i["start_section"] for i in master])
In order to better illustrate the functionality of this module, the following code blocks walk through the marginalia determination process for three examples images. In practice this code would exist as part of a loop to enable large batch processing. The 'meta' list below would normally house the metadata for an entire batch of image transformations. For this demonstration, it will ultimately hold a list of metadata for our three images.
meta = []
To begin, we open the image file associated with each row of metadata. Normally this happens for each image at the beginning of the aforementioned loop.
It is important to note that our original files from the Internet Archive were formatted as JP2000s. The files have been converted to JPGs in this example in order to increase accessibility across systems with standard versions of PIL. In the production code, we simply open the file located at path f in the code below.
f = os.path.join(master[0]["folder"],master[0]["filename"])
#original:
#orig = Image.open(f)
#.jpg conversion for demonstration purposes
orig1 = Image.open(f.replace(".jp2", ".jpg"))
Below are our three example images pasted side by side. These images were selected to demonstrate procedural differences for "section_start" pages as well as the de-skewing process.
Next we set the page's "side" value. Again, this represents the side of an open book on which one would find the page in question. This piece of metadata is important because it will indicate the side of the page on which we can expect to find marginalia.
side1 = master[0]["side"]
The text in scanned document pages will usually appear slightly rotated from the horizontal (see orig2 above). In order to produce the best OCR results for our pages, we'll need to rotate each image and "straighten" our lines of text. To do so we use the rotation_angle function from cropfunctions.py. This function determines the original image's angle of rotation from the horizontal, which will be used later to "de-skew" our image.
The rotation_angle function accepts an original image variable as an argument. The image is then converted to a binarized numpy pixel array (bin_img). The function then defines a list of angles from -1 to 1 in intervals of .25 degrees. The loop iterates through this list, sending each angle along with bin_img to another crop function, find_score.
find_score evaluates rotations of bin_img according to each of the angles in the above list. It does so first by rotating the pixel array by the given angle. The rotated array (data) is then converted to a vertical histogram of pixel 'counts'. These values represent the number of non-zero pixels found along the horizontal axis for a given vertical axis coordinate. The function then calculates score, which is maximized for histograms with the most definitive 'peaks' and 'valleys'. A histogram with the most notable 'peaks' and 'valleys' is indicative of a text image with lines that run parallel to the horizontal axis. Thus, rotation angles that achieve higher scores have achieved more success in 'straightening' the original image.
The after finding scores for each angle in the list, rotation_angle returns the angle with the highest score - the angle that will best straighten an image given its particular degree of skew.
Note - the de-skewing operation was derived from this guide.
ang1 = rotation_angle(orig1)
#Demo: Our first example image did not require any de-skewing.
print(ang1)
Once we have determined an appropriate "de-skew" angle for the page, we determine the image's neutral background color, convert the image to a binary color scheme, and crop the image to the text itself. These steps are accomplished using the trim function from cropfunctions.py, which returns the following values:
The "find_top" parameter for the trim instructs the function to crop away the header section of each page if "TRUE" (which is the default value). This retains only the body text and accompanying marginalia for most pages. However, for section start pages, the large section headers contain useful metadata. For these pages we set the find_top parameter to "FALSE" so the trim function will return a cropped image that retains this important information.
Below is a detailed walkthrough of the trim process for one of our example images.
#example arguments
img=orig1
angle=ang1
buff=10
find_top=True
# Set the width and the height according to the image's dimensions.
width, height = img.size
# Establish the median pixel value for your image.
background = tuple(ImageStat.Stat(img).median)
# Create a new image with the same size/mode as the original.
# Color the image according to the "background" value above.
bg = Image.new(mode = img.mode, size = img.size, color = background)
# This PIL function returns the absolute value of the pixel-by-pixel
# difference between two images. The effect here is to generate
# a high-contrast, "negative" version of the original image
diff = ImageChops.difference(img, bg)
# This step creates a binarized version of diff. The background color
# is set to black (0), while the text appears white (1). This step
# enables easier determination of text/non-text areas for later steps
# in the marginalia determination process.
# In order to binarize our image, we first determine an "offset" value
# for the step below. This value corresponds to 3 times the std.
# deviation of the RGB color channel with the greatest amount of
# variation for the image in question. 3 was chosen here as an
# arbitrary value that works with our set of images. This value
# may require adjustment for use with other collections.
offset = round(max(ImageStat.Stat(diff).stddev) * 3)
# This effectively "brightens" all of the text values so
# that they can be more easily distinguished from background "noise"
# pixel values during the binary conversion process.
# The "convert" function creates a new image with binary pixel values
# and very little "noise."
diff = ImageChops.add(diff, diff, 2.0, - offset).convert("1")
# rotate the image
if angle !=0:
diff = diff.rotate(angle)
# calculate the bounding box of the
# non-zero (i.e. text) areas of the image
bbox = diff.getbbox()
# If our page is a "start section" page, this step can be skipped.
# The above step will give us a bbox that encompasses all text
# areas of the page, including the main body, the header,
# and any marginalia. However, if the page is not a "section start"
# page, we remove the header, the top margin, and the bottom
# margin from diff.
imglist=[]
if find_top:
top = 0
# The effect of this is to iterate through increasingly
# deep upper bands of diff starting with the top of the header.
# Once the .getbbox() function stops finding new non-zero values
# below a certain point within each new, deeper band,
# we can interpret that point (top) as the bottom boundary
# of the header text. It is called "top" because it designates
# the top of the non-header area of the page.
for k in range(50,361,30):
try:
imglist.append(diff.crop(bbox).crop((round(width/5),0,round(4*width/5),k)))
h = diff.crop(bbox).crop((round(width/5),0,round(4*width/5),k)).getbbox()[3]
if top == h and h >= 20:
#check h>=20 to avoid small watermarks, characters are
#usually taller than 20 pixels
break
elif h != k:
top = h
except:
pass
# Convert the bbox tuple to a list
# so that it can be more easily edited
bbox = list(bbox)
# set bbox's upper boundary to "top",
# which is the bottom of the header.
# Add buffer to make sure no header pixels are included
bbox[1] += top+buff
# Determine the text-containing areas of the new bbox,
# from which we just removed the header
bbox1 = diff.crop(tuple(bbox)).getbbox()
# "combine_bbox" maps changes made to a cropped version
# of an image back onto the original version of
# that image. Here, the effect is to translate the cuts we've
# made to diff (removing the header), into a bbox that contains
# only main text area of the original image (orig1)
bbox = combine_bbox(bbox,bbox1)
# add a buffer (pre-set) on all sides of bbox,
# ensuring that no body text areas have been left out.
bbox = buffer_bbox(bbox = bbox, buff = buff, width = width, height = height)
# This conditional statement accounts for situations in
# which the "offset" value determined during the
# binarization process is so high that all pixels in the image
# are reduced to black (0). If this is the case, no bbox can be found.
# Because offset is determined using the std. deviation of
# image pixel values, such a situation occurs rarely, if ever.
if bbox:
print(diff.crop(bbox), background, bbox)
else:
print("None")
Now we perform the above steps using our example images.
if master[0]["start_section"]:
diff1, background1, orig_bbox1 = trim(orig1, angle=ang1, find_top=False)
else:
diff1, background1, orig_bbox1 = trim(orig1, angle=ang1)
Below are the diff images for each of our three example pages. The middle page has retained a large area of its bottom margin because of an imperfection on the page itself (see the black ink dot in the bottom right margin area of the second image above). This was mistaken as "text" by the .getbbox() in the above steps for this particular image.
Once we have cropped the image only to include areas with text (without headers for most pages, with headers if the page is a "section start" page), the next step is to determine a dividing line to separate body text from marginalia. Once this line has been determined, we can produce a new set of coordinates that will crop all but the main body text from our original file.
Note: Starting in the 1950s, marginalia was eliminated from the print versions of laws. As a result, image files from these decades do not need to undergo the cutting process. The "orig_bbox" coordinates from the above section are adequate for determining the coordinates for the main body of text because there is no marginalia to eliminate. As a result, the production version of this module features the follow code:
if "196" in master[0]["folder"] or "195" in master[0]["folder"]:
total_bbox1 = orig_bbox1
cut1 = None
Despite the above caveat, most documents in our corpus will undergo the following cutting procedure.
The first step is to split our diff image into horizontal bands. This is accomplished by the get_bands function from cropfunctions.py. In some ways, we are placing each page into a digital 'paper shredder' that will separate it into a number of horizontal strips.
The get_bands function splits diff into horizontal bands with heights (in pixels) of bheight. In our case, bheight is set to 50 because that is the approximate height of a line of text. The function iterates top to bottom through all bands. Within each 50-pixel-high band of diff, a smaller boundary box is determined to encompass only those areas of the band that contain non-zero pixel values (i.e. text). The third parameter in this function, which is not called below but defaults to 20, is called rd (for "round").
The function returns a dictionary that contains:
band_bboxes), one for each band. Each band dictionary contains values for index, raw, and round.rd value used by the function.The keys in "band_bboxes" correspond to the following information:
rd pixels for the left and right coordinatesFor most pages, many of the bands containing only main body text will be approximately the same width. Bands with indented text will be shorter, while bands with marginalia will be longer. We determine "rounded" versions of each band_bbox to reduce that amount of variation amongst the outer edge values of all bands. This will be important in determining the "cut" line between marginalia and main-body text.
bheight = 50
band_dict1 = get_bands(diff1, bheight=bheight)
The example images below demonstrate the "banding" process. It is important to note that the images below show the bands superimposed on the original images rather than their diff versions. In reality, get_bands works with diff and not with orig.
Once we have "banded" our image, we can determine the line that separates the marginalia from the main body text. This is accomplished using the simp_bd function from cropfunctions.py. This function looks at the outer edge of all bands to determine an appropriate horizontal coordinate pixel value to act as a "cut" line.
This function accepts as arguments band_dict from the previous step, the binarized, cropped version of the original image (diff), the side value, the width value (derived from diff), pad, and frequency. The pad argument increases or reduces the cut location to account for aggresively skewed images, while the frequncy argument sets a threshold at which the cut coordinate is automatically determined because more than this this proportion of outer band edges shared the same value.
In a basic sense, the operation of simp_bd consists of:
The value returned by sim_bd is used to determine the outer edge of the bbox containing only the main body text with no marginalia.
Below is a detailed walkthrough of simp_bd for our third example image:
#Set example arguments
band_dict = band_dict1
diff = diff1
side = side1
width = diff1.size[0]
pad = 10
allow = (0.1, 0.3)
freq = 0.8
minfreq = 0.1
# create a list of the rounded band bboxes
bands = [v["round"] for v in band_dict["band_bboxes"]]
# create a list of the raw band bboxes
raw = [v["raw"] for v in band_dict["band_bboxes"]]
# get the round ("rd") value from the band_dict in question.
rd_dist = band_dict["rd"]
# This assigns integer values to each string value of "side"
# This is important because it tells the function
# which edge in the band bboxes to consider as the "outer" edge
# the side that will be adjacent to any marginalia.
# Bboxes indicate L, T, R, B, so left (L) is set to 0,
# while right (R) is set to 2
side_dict = {"left":0, "right":2}
# strip_list returns a filtered list of all L/R values
# depending on the side you send it
# This list has been filtered for any one-off values
# to avoid bad cut candidates
bands = strip_list(bands, side_dict[side])
# Set minimum and maximum allowable pixel values based on the
# side & width of the page.This is done by determining a the range
# of pixel values for a given page that is between a distance of
# .1 and .3 times (see "allow" variable) the total width of the page
# from the outside edge of the page. These values are set because
# most lines that divide main body from marginalia will appear in
# this area of a given page. As a result, bands with L or R edges
# that are unusually close to or far from the outside of edge of
# a page are eliminated from contention as candidates for cutting
try:
if side == "right":
pixmin = width-allow[1] * width
pixmax = width-allow[0] * width
newbands = [b for b in bands if b > pixmin and b < pixmax]
rev = True
elif side == "left":
pixmin = allow[0] * width
pixmax = allow[1] * width
newbands = [b for b in bands if b > pixmin and b < pixmax]
rev = False
# Once we have determined list of viable cut candidates within
# our allowable range of outer band edges (newbands), we check
# to see if that list contains any values. If it doesn't, and we were
# unable to find any bands that had edges within our allowable range,
# we set our cut value to the outside edge of the original cropped image
# (0 for Left, width for Right). However, if newbands
# is NOT empty, we perform the following operation:
if newbands:
# get counts for all of the candidate edge values
# the reverse parameter sorts values according to
# how close they are to a page's side:
# low-high for left, high-low for right
ct = Counter(sorted(newbands, reverse=rev))
# determine which allowable edge value is most frequent
# If that value represents more than `freq` of all values
# in newbands (i.e. allowable values), set cut to that value.
# If "cut" combined with "pad" (see above) is within
# the total bounds of our original cropped image (diff),
# then we take that value as our final "cut" value
if ct.most_common(1)[0][1]/(sum([e for e in dict(ct).values()])) > freq:
cut = [e[0] for e in ct.most_common(1)][0]
if side == "right":
cut = min(cut + pad, width)
elif side == "left":
cut = max(cut - pad, 0)
# If none of the viable cut candidates show a clear majority,
# we choose the best candidate: this loop first filters our
# set of edge candidates down to those which appear more often than
# our minimum frequency value (minfreq).
# Once this smaller subset has been determined, we iterate through
# each remaining candidate and create a sample "cut".
# For each "cut", we determine the pixel density for
# the resulting marginalia bbox ("margin_bbox") and main
# text bbox ("text_bbox"). The "best" candidate is that which
# maximizes the difference between these two values.
else:
m = [k for k in ct.keys() if ct[k] > len(newbands)*minfreq]
cut_dict = {k:0 for k in m}
if len(m)>1:
for cut in m:
text_bbox = [0, 0] + list(diff.size)
text_bbox[side_dict[side]] = cut
other = abs(side_dict[side]-2)
text_bbox[other] = (1-side_dict[side]) * (50)+cut
margin_bbox = [0, 0] + list(diff.size)
margin_bbox[side_dict[side]] = (side_dict[side]-1) * (50)+cut
margin_bbox[other] = cut
mean_diff = np.mean(diff.crop(text_bbox)) - np.mean(diff.crop(margin_bbox))
cut_dict[cut] += mean_diff
cutrange = max(cut_dict, key=cut_dict.get)
elif len(m)==1:
cutrange = m[0]
# Generate a list of candidate cuts that fall
# within an acceptable "cutrange"
pix_list = [i[side_dict[side]] for i in raw if abs(i[side_dict[side]]-cutrange) < rd_dist/2]
# If the page is a "right" page, select the remaining
# candidate that is furthest to the right. If that
# cut is outside the original page width,
# set the cut to equal to the image width
if side == "right":
cut = min(max(pix_list)+pad, width)
# If the page is a "left" page, select the remaining
# candidate that is furthest to the left.
# If that cut is less than zero, set the cut to equal
# to zero to avoid cutting outside of the page boundaries.
elif side == "left":
cut = max(min(pix_list)-pad, 0)
else:
#If no cut was determined, set cut equal to the edge of the page.
if side == "right":
cut = width
elif side == "left":
cut = 0
# Print statements used here instead
# of return statements for demonstration purposes.
print("return cut:", cut)
except:
if side == "left":
print("return 0:", 0)
elif side == "right":
print("return width:", width)
width1 = diff1.size[0]
cut1 = simp_bd(band_dict=band_dict1, diff=diff1, side=side1, width=width1,
pad=10, freq =0.9)
Recall that our "banding" and "cutting" procedures have used diff, the cropped and binarized version of our original image. The following pieces of code allow us to determine the main-body-text bbox from orig, the original version of our image. The information describing this new bbox is the metadata that we will eventually store so that later modules can work directly with the original image files.
To begin this next step, we first determine the main-body-text-bbox from diff:
# create a bbox for the entire diff image
out_bbox1 = [0, 0] + list(diff1.size)
# Set the side value so that the function
# knows which bbox dimension to "cut"
side_dict = {"left":0, "right":2}
# Set the outer edge of the page equal to cut,
# rather than the original outer edge from diff
# This creates a bbox that is cropped to include
# only the main text areas of diff
out_bbox1[side_dict[side1]] = cut1
Next, we use combine_bbox from cropfunctions.py to determine the main body bbox for the original image, orig. Recall here that orig_bbox contains the crop dimensions for diff from within the original image file.
The function uses orig_bbox to situate diff within the original image. It then adjusts the values from orig_bbox to accommodate for the differences between diff and its "cut" version, represented by out_bbox.
In other words, earlier we used our cropped version of the original image to determine the "cut" line separating main body text from marginalia. As a result, we find the location of that cropped version within the original image (orig_bbox) and adjust it based on our "cut" values (out_bbox). The resulting bbox (total_bbox) encompasses only the main-body text area of the original image
total_bbox1 = combine_bbox(orig_bbox1,out_bbox1)
The image below shows each of our original page images, cropped to include only the main body text with no marginalia. The middle page retains its header information because it is a "section_start" page.
After we've determined the angle, side, and cut values, we add them along with the associated image's filename, the RGB color values of the image's average background color (background), and the image's total_bbox dimensions to a list of metadata for each file. This list is then added to our overall meta list (see beginning of the "Gathering unedited image files" section). In the production code that loops through all images in a batch, meta will contain metadata for all images in each batch.
meta_list1 = [master[0]["filename"], ang1, side1, cut1]
meta_list1.extend(background1)
meta_list1.extend(total_bbox1)
meta.append(meta_list1)
We then output the compiled metadata for all of our images to a .csv file.
headers = ["file","angle","side","cut","backR","backG","backB",
"bbox1","bbox2","bbox3","bbox4"]
with open("Output\marginalia_metadata_demo.csv","a",newline="") as outfile:
writer=csv.writer(outfile)
writer.writerow(headers)
for row in meta:
writer.writerow(row)
The output file contains the following fields:
The purpose of each field is as follows:
To recap, marginalia_determination consists of the following steps. Associated functions from cropfunctions.py are listed next to each step:
rotation_angle and find_score)trim)get_bands and simp_bd)combine_bbox)The left image below shows an original page image, orig, with orig_bbox outlined in red. On the right you can see diff, which results from both the de-skewing (rotation_angle and find_score) and trim processes. Essentially, diff is the orig_bbox area from orig after being de-skewed and binarized for color.
The left image below shows diff once again, this time with the bands from the get_bands outlined in orange. The cut line from simp_bd is outlined in blue. The right image shows the final crop, after the information stored in out_bbox has been converted to work with with orig via the combine_bbox function. This image represents orig cropped to the dimensions stored in total_bbox. After these operations, the final boundary box along with other metadata (see above) are stored in the output file.